Skip to content

feat(kiloclaw): proactively refresh API keys approaching expiry#1049

Merged
pandemicsyn merged 9 commits intomainfrom
florian/chore/proactively-refresh-token
Mar 16, 2026
Merged

feat(kiloclaw): proactively refresh API keys approaching expiry#1049
pandemicsyn merged 9 commits intomainfrom
florian/chore/proactively-refresh-token

Conversation

@pandemicsyn
Copy link
Contributor

@pandemicsyn pandemicsyn commented Mar 11, 2026

Summary

Instances' API keys (JWTs) have a fixed expiry. Today, if a key expires while a sandbox is running, the gateway loses API access until the next full restart re-mints the key. This PR adds proactive refresh: the reconciliation alarm checks if the key expires within 3 days (configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var) and mints a fresh one.

How the fresh key is delivered:

  1. Fly machine config is always updated with skipLaunch (durable persist). This ensures the key survives cold starts regardless of whether the live push succeeds.
  2. Live push to the controller's process.env via POST /_kilo/env/patch is attempted. If the controller supports it, SIGUSR1 triggers a graceful in-process restart in OpenClaw — it drains active tasks (up to 90s), closes the server, and restarts the server loop, which re-reads process.env and picks up the new key.
  3. If the push fails (404 from old controller, network error, not signaled), no forced restart — the Fly config already has the new key, and the machine picks it up on its next natural restart (user-initiated, crash, deploy).

No version gating — capability detection is used. The push is always attempted; a 404 from old controllers is caught and handled gracefully.

Failure handling:

  • Both paths fail (Fly config update AND push): key/expiry is NOT persisted to DO state, so the next alarm cycle retries.
  • Push fails, Fly config succeeds: key is persisted. Next restart picks it up.
  • Fly config fails, push succeeds: key is persisted. Gateway has it live, but next cold start will need another refresh.

What changed:

  • Controller endpoint (POST /_kilo/env/patch): accepts an allowlisted set of env vars (KILOCODE_API_KEY), writes them to process.env, and sends SIGUSR1 to the gateway. Bearer-auth gated same as existing /_kilo/config/* routes.
  • Fly client (updateMachine): added skipLaunch option — updates machine config without restarting.
  • Reconciliation (reconcileApiKeyExpiry): new step wired after reconcileVolume. Flow: mint → update Fly config (skipLaunch) → try push → persist only if at least one path succeeded. minSecretsVersion forwarded from ensureEnvKey() to prevent secret propagation races.
  • Config (getProactiveRefreshThresholdMs): reads PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var with fallback to 72h default. Set to a large value (e.g. 8760 = 1 year) to trigger refresh on all running instances for testing.

Verification

  • pnpm typecheck — passes
  • pnpm test — 566/566 tests pass (30 test files), including 26 new tests across 5 new/modified test files
  • pnpm lint — passes
  • Additional verification (manual testing, staging deploy, etc.)

Visual Changes

Old controller:

Screenshot 2026-03-12 at 11 52 06 AM

New controller:

{"tag":"reconcile","reason":"alarm","action":"api_key_refreshed","user_id":"7c2f4f32-1ef0-43cb-84bb-0b51ad5eb7bf","new_expires_at":"2026-04-15T13:37:07.000Z","pushed":true,"flyConfigUpdated":true,"controller_version":"2026.3.12"}

Reviewer Notes

  • No forced restarts. The refresh process never causes downtime. If the live push fails, the machine keeps running with the old key until it naturally restarts, at which point it boots with the fresh key from Fly config.
  • Capability detection replaces version gating: the push to /_kilo/env/patch is always attempted. Old controllers return 404, which is caught gracefully. No manual version constant to maintain.
  • The Fly config update always uses skipLaunch: true with minSecretsVersion from ensureEnvKey().
  • Key/expiry is only persisted to DO state when at least one delivery path succeeded (pushed || flyConfigUpdated). If both fail, the next alarm retries.
  • Promise.race for the mint timeout clears the timer on success. Chosen over AbortSignal.timeout because Hyperdrive doesn't propagate abort signals.
  • To test in staging: set PROACTIVE_REFRESH_THRESHOLD_HOURS to 8760 (1 year) in wrangler vars — every running instance with a known expiry will refresh on its next alarm cycle (within 5 min).
  • Structured log events for dashboarding: filter on tag:"reconcile" AND action:"api_key_*". Key events: api_key_refreshed (with pushed and flyConfigUpdated fields), api_key_push_error, api_key_refresh_failed_all_paths.

The reconciliation alarm now checks if the instance's API key expires
within 7 days and, if the controller supports it, mints a fresh key,
pushes it via the new /_kilo/env/patch endpoint, updates the Fly machine
config (without restart via skip_launch), and persists the new expiry.

Key changes:
- Controller: POST /_kilo/env/patch with KILOCODE_API_KEY allowlist
- Fly client: skip_launch option on updateMachine
- Reconcile: reconcileApiKeyExpiry with version gating, mint, push, persist
- Config: isCalverAtLeast helper and PROACTIVE_REFRESH_THRESHOLD_MS constant
…controllers

Restructure reconcileApiKeyExpiry so the version check only gates the
push-to-controller step, not the entire flow. When the controller is
too old for /_kilo/env/patch, we still mint a fresh key, update the
Fly machine config (triggering a restart), and persist to DO state.

Also:
- Reduce PROACTIVE_REFRESH_THRESHOLD_MS from 7 days to 3 days
- Guard against starting stopped machines (check Fly machine.state
  before deciding skipLaunch)
- Add test for stopped-machine safety guard
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 11, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/config.ts 82 PROACTIVE_REFRESH_THRESHOLD_HOURS accepts values at or above the 30-day token lifetime, which makes reconcile mint and push a new key on every alarm.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (15 files)
  • kiloclaw/controller/src/index.ts - 0 issues
  • kiloclaw/controller/src/routes/env.test.ts - 0 issues
  • kiloclaw/controller/src/routes/env.ts - 0 issues
  • kiloclaw/scripts/controller-smoke-test.sh - 0 issues
  • kiloclaw/src/config.test.ts - 0 issues
  • kiloclaw/src/config.ts - 1 issue
  • kiloclaw/src/durable-objects/gateway-controller-types.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
  • kiloclaw/src/fly/client.test.ts - 0 issues
  • kiloclaw/src/fly/client.ts - 0 issues
  • kiloclaw/src/types.ts - 0 issues
  • kiloclaw/worker-configuration.d.ts - 0 issues
  • kiloclaw/wrangler.jsonc - 0 issues

Reviewed by gpt-5.4-20260305 · 669,096 tokens

- Reorder: update Fly config (skipLaunch) before hot patch attempt so
  the key is durably persisted before we try the live push
- Forward minSecretsVersion from ensureEnvKey() to updateMachine to
  prevent secret propagation races on restart
- Use updateMachine without skipLaunch for restart instead of
  stop+start to avoid leaving the machine stopped on partial failure
- Only persist new key/expiry to DO state when at least one delivery
  path succeeded (push or Fly config update)
- Make refresh threshold configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS
  wrangler var (default 72h) for testing
- Reduce default threshold from 7 days to 3 days
Remove MIN_ENV_PATCH_CONTROLLER_VERSION, isCalverAtLeast(), and the
getControllerVersion() pre-flight check. Instead, always try the push
to /_kilo/env/patch — if the controller returns 404 (old image), the
catch block handles it and falls through to the restart path.

This eliminates a manually maintained calver constant that had to
match the controller release date. The cost is one extra HTTP call
per refresh event on old controllers (the 404), which is negligible
since refresh only triggers once per key expiry cycle.
Never force-restart the machine during key refresh. The Fly config is
updated with skipLaunch (durable persist), the push is attempted for
live delivery, and if the push fails the machine picks up the new key
on its next natural restart (user-initiated, crash, deploy).

This avoids any risk of downtime caused by the refresh process itself.
…t env patch

- Fetch controller version (best-effort) during API key refresh and
  include it in api_key_expiry_approaching, api_key_push_error, and
  api_key_refreshed log events for observability
- Add env patch endpoint checks to controller-smoke-test.sh
…vely-refresh-token

# Conflicts:
#	kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts
}

for (const [key, value] of Object.entries(validated)) {
process.env[key] = value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The hot-patched key leaves KILO_API_KEY stale when the Kilo CLI feature is enabled

start-openclaw.sh aliases KILOCODE_API_KEY into KILO_API_KEY before launching the controller, but this route only updates process.env.KILOCODE_API_KEY. After SIGUSR1, the supervised gateway child respawns from the controller's current environment, so it still inherits the old KILO_API_KEY and the Kilo CLI auth plugin keeps using the expired token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill fix up the cli in a follow up.

…vely-refresh-token

# Conflicts:
#	kiloclaw/controller/src/index.ts
if (envOverrideHours) {
const hours = Number(envOverrideHours);
if (!Number.isNaN(hours) && hours > 0) {
return hours * 60 * 60 * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Thresholds at or above the token lifetime trigger refresh on every alarm

mintFreshApiKey() always issues 30-day tokens (KILOCODE_API_KEY_EXPIRY_SECONDS). Because this helper accepts any positive hour value, setting PROACTIVE_REFRESH_THRESHOLD_HOURS to 720 or larger means a newly minted key is still inside the threshold on the next 5-minute reconcile, so the worker will mint and push a fresh token forever. Clamp the override below the token lifetime or fall back to the default before returning it.

@pandemicsyn pandemicsyn merged commit f8c3ff9 into main Mar 16, 2026
18 checks passed
@pandemicsyn pandemicsyn deleted the florian/chore/proactively-refresh-token branch March 16, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants